Pay-as-you-go Configuration of Entity Resolution
نویسندگان
چکیده
Entity resolution, which seeks to identify records that represent the same entity, is an important step in many data integration and data cleaning applications. However, entity resolution is challenging both in terms of scalability (all-against-all comparisons are computationally impractical) and result quality (syntactic evidence on record equivalence is often equivocal). As a result, end-to-end entity resolution proposals involve several stages, including blocking to efficiently identify candidate duplicates, detailed comparison to refine the conclusions from blocking, and clustering to identify the sets of records that may represent the same entity. However, the quality of the result is often crucially dependent on configuration parameters in all of these stages, for which it may be difficult for a human expert to provide suitable values. This paper describes an approach in which a complete entity resolution process is optimized, on the basis of feedback (such as might be obtained from crowds) on candidate duplicates. Given such feedback, an evolutionary search of the space of configuration parameters is carried out, with a view to maximizing the fitness of the resulting clusters. The approach is payas-you-go in that more feedback can be expected to give rise to better outcomes. An empirical evaluation shows that the co-optimization of the different stages in entity resolution can yield significant improvements over default parameters, even with small amounts of feedback.
منابع مشابه
Financing Long-term Care: Some Ideas From Switzerland; Comment on “Financing Long-term Care: Lessons From Japan”
Ikegami reviews the implementation of mandatory long-term care insurance systems in Germany and Japan, which are organized as pay-as-you-go systems. I propose to go one step further and implement a multi-pillar, mandatory and voluntary long-term care financing system, which combines pay-as-you-go with capital-funded elements. The proposal is based on the observation tha...
متن کاملMinoan ER: Progressive Entity Resolution in the Web of Data
Entity resolution aims to identify descriptions of the same entity within or across knowledge bases. In this work, we present the Minoan ER platform for resolving entities described by linked data in the Web (e.g., in RDF). To reduce the required number of comparisons, Minoan ER performs blocking to place similar descriptions into blocks and executes comparisons to identify matches only between...
متن کاملPay-as-you-go Data Integration: Experiences and Recurring Themes
Data integration typically seeks to provide the illusion that data from multiple distributed sources comes from a single, well managed source. Providing this illusion in practice tends to involve the design of a global schema that captures the users data requirements, followed by manual (with tool support) construction of mappings between sources and the global schema. This overall approach can...
متن کاملBcfg2: A Pay As You Go Approach to Configuration Complexity
While configuration management tools are an area of substantial research and development, tool adoption has lagged behind. We assert that lack of adoption is caused in large part by complexity costs. We will describe bcfg2, a configuration management tool, and its approach to complexity mitigation.
متن کاملC3D+P: A summarization method for interactive entity resolution
Entity resolution is a fundamental task in data integration. Recent studies of this problem, including active learning, crowdsourcing, and pay-as-you-go approaches, have started to involve human users in the loop to carry out interactive entity resolution tasks, namely to invite human users to judge whether two entity descriptions refer to the same real-world entity. This process of judgment re...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- T. Large-Scale Data- and Knowledge-Centered Systems
دوره 29 شماره
صفحات -
تاریخ انتشار 2016